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In this paper we extend our previous work on the stochastic block model, a commonly used 
Oh' generative model for social and biological networks, and the problem of inferring functional groups 

or communities from the topology of the network. We use the cavity method of statistical physics to 
obtain an asymptotically exact analysis of the phase diagram. We describe in detail properties of the 
detectability/undetectability phase transition and the easy /hard phase transition for the community 
detection problem. Our analysis translates naturally into a belief propagation algorithm for inferring 
the group memberships of the nodes in an optimal way, i.e., that maximizes the overlap with the 
underlying group memberships, and learning the underlying parameters of the block model. Finally, 
| we apply the algorithm to two examples of real-world networks and discuss its performance. 
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I. INTRODUCTION 

Many systems of interest consist of a large number of nodes (e.g. atoms, agents, items, Boolean variables) and 
sometimes the only information we can observe about the system are connections, or edges, between pairs of nodes. 
The resulting structure is usually called a network. A naturally arising question is whether we are able to understand 
something about the underlying system based purely on the topology of the network. In many situations different 
nodes have different functions. For instance each of the nodes may belong to one of q groups, often called functional 
modules or communities, and the structure of the network may depend in an a priori unknown way on the group 
memberships. In general we are interested in finding how the network's function and topology affect each other. 
Hence an interesting and practically important question is whether, based on the known structure of the network, 
it is possible to learn in what way the group memberships influenced the structure of the network, and which node 
belongs to which group. 

One well-studied example of the above setting is the so-called community detection problem, for a review see e.g. 
In the study of complex networks, a network is said to have community structure if it divides naturally into groups of 
nodes with denser connections within groups and sparser connections between groups. This type of structure, where 
nodes are more likely to connect to others of the same type as in a ferromagnet, is called assortative. The goal is to 
detect the communities and to estimate, for instance, the expected number of connections within and outside of the 
community. 

In other cases the network can be disassortative, with denser connections between different groups than within 
groups. For instance, a set of predators might form a functional group in a food web, not because they eat each other, 
but because they eat similar prey. In networks of word adjacencies in English text, nouns often follow adjectives, 
but seldom follow other nouns. Even some social networks have disassortative structure: for example, some human 
societies are divided into moieties, and only allow marriages between different moieties. 

In research on networks, generative models of random graphs often provide useful playgrounds for theoretical 
ideas and testing of algorithms. The simplest such model is the Erdds-Renyi random graph [2J, where every pair of 
nodes is connected independently with the same probability. In this paper we study in detail the most commonly 
used generative model for random modular networks, the stochastic block model, which generalizes the Erdos-Renyi 
random graph by giving each pair of nodes a connection probability depending on what groups they belong to. We 
extend our previous analysis from Q , and provide a rather complete and asymptotically exact analysis of the phase 
diagram of this model. We focus on the question of whether we can infer the original group assignment based on the 
topology of the resulting network, and learn the unknown parameters that were used for generating the network. We 
use the cavity method developed in the statistical physics of spin glasses [1] to evaluate the phase diagram. 

Our results naturally translate into a message-passing algorithm called belief propagation Q, that we suggest 
as a heuristic tool for learning parameters and inferring modules in real networks. Our results are exact in the 
thermodynamic limit, but the algorithm works well even on small networks as long as they are well-described by 
the generative model. Our theory and algorithms are straightforwardly generalizable to other generative models, 
including hierarchical module structures Q, overlapping modules Q, or degree-corrected versions of the stochastic 
block model [|[. In the present work we focus on graphs that do not contain any multiple edges or self-loops, but the 
generalization would be straightforward. 

The paper is organized as follows. In Section Hi Al we define the stochastic block model. In III Bl and III CI we review 
the Bayesian theory for optimal inference and learning in the present context. In Section III Dl we discuss in more 
detail the relation between our work and previous work on community detection, and summarize the advantages of 
our approach. In Section IlIII we present the cavity method for asymptotic analysis of the model, and the associated 
belief propagation algorithm for parameter learning and inference of the group assignment. In Section HVl we analyze 
the phase diagram and describe in detail the different phase transitions introduced in Q . Finally, in Section [V] we 
discuss applications of our approach to real-world networks. 

II. THE STOCHASTIC BLOCK MODEL 

A. Definition of the model 

The stochastic block model is defined as follows. It has parameters q (the number of groups), {n a } (the expected 
fraction of nodes in each group a, for 1 < a < q), and a q x q affinity matrix p a b (the probability of an edge between 
group a and group b). We generate a random directed graph G on N nodes, with adjacency matrix Aij — 1 if there 
is an edge from i to j and otherwise, as follows. Each node i has a label U G {1, . . . ,q}, indicating which group it 
belongs to. These labels are chosen independently, where for each node i the probability that f, = a is n a . Between 
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each pair of nodes we then include an edge from i to j with probability pt it Ui setting Ay = 1, and set Ay = 
with probability 1 — Pt^u. We forbid self-loops, so Aji = 0. 

We let N a denote the number of nodes in each group a. Since iV a is binomially distributed, in the limit of large N we 
have N a /N = n a with high probability. The average number of edges from group a to group b is then M ab = p a bN a N b , 
or M aa = Paa,N a (N a — 1) if a = b. Since we are interested in sparse graphs where p a b — 0(1/ N), we will often work 
with a rescaled affinity matrix c ab — Np ab . In the limit of large N , the average degree of the network is then 

c = ^2c ab n a n b . (1) 

a, 6 

We can also consider the undirected case, where Ay, p ab , and c ab are symmetric. The average degree is then 

n 2 

= ^2 Cabn a n b + ^2 c aa ~/r ■ (2) 



c 

a<b 



Several special cases of this stochastic block model are well known and studied in the literature. Without the aim 
of being exhaustive let us mention three of them. We stress, however, that from a mathematical point of view many of 
our results are non-rigorous and hence rigorous proofs of our conjectures would be a natural and important extension 
of our work. 

• A common benchmark for community detection is the "four groups" test of Newman and Girvan 9] , in which 
a network is divided into four equally-sized groups: n a = 1/4, the average degree is c = 16, and 



Cab = /u (3) 




where Cj n > c ou t so that the community structure is assortative. By varying the difference between c- m and c out , 
we create more or less challenging structures for community detection algorithms. It is usually expected in the 
literature that as N — > oo it is possible to find a configuration correlated with the original assignment of nodes 
to the four groups as soon as the average in-degree (number of edges going to the same group) is larger than 
4, or c/q in general, and that a failure to do so is due to an imperfection of the algorithm. In contrast, from 
our results (see e.g. Fig. [TJ it follows that in the limit N — » oo, unless the average in-degree is larger than 7 no 
efficient algorithm will be better than a random choice in recovering the original assignment on large networks. 
At the same time we design an algorithm that finds the most likely assignment for each node if the in-degree is 
larger than 7. 

• Planted graph partitioning, a generalization of the above example, is well known in the mathematics and 
computer science literature. We have n a = 1/q, and p ab = p m if a = b and p ou t if a ^ b. We again assume an 
assortative structure, so that p m > p ou t, but now we allow the dense case where p ln , p ou t might not be 0(1 /N). 
A classical result [l(| shows that for p in — p out > 0(\ogN/N) the planted partition is with high probability 
equal to the best possible partition, in terms of minimizing the number of edges between groups. Another 
classical result [ll[ shows that the planted partition can be easily found as long as p; n — p ou t > 0(N^ 1 / 2+t ) 
for arbitrarily small e. In this work we do not aim at finding the best possible partition nor the planted 
partition exactly. Rather, we are interested in conditions under which polynomial-time algorithms can find a 
partition that is correlated with the planted partition. In Section IIV Al we will show that this is possible when 
Pin — Pout > \/ qPm + q(q — l)Pout/VN. We will also argue that, depending on the values of the parameters, this 
task may be impossible, or exponentially hard, if this condition is not satisfied. 

• In the planted coloring problem we again have n a = 1/q and p ab = p ou t for a ^ b, but p aa — so that there 
are no edges between nodes in the same group. The average degree of the graph is then c = (q — l)p QU tN/q. 
A rigorous analysis of the problem [l2j shows that for c > 0(q 2 ) it is possible to find in polynomial time a 
proper coloring correlated strongly with the planted coloring. The cavity method result derived in [l3| shows 
that configurations correlated with the planted coloring are possible to find if and only if c > c q , where the 
critical values c q are given in [l3| . For large q we have c q = 0(2q\ogq). However, it is known how to find such 
colorings in polynomial time only for c > (q— l) 2 [lJj, and there are physical reasons to believe that c — (q— l) 2 
provides a threshold for success of a large class of algorithms, as we will explain later. 

We stress that in this paper we are mostly interested in the large N behavior of the generative model and hence in 
the text of the paper we will neglect terms that are negligible in this limit. For instance, we will write that the number 
of edges is M = cN/2 even if for finite size N one typically has fluctuations around the average of size 0(\/~N). 
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Now assume that a network was generated following the stochastic block model described above. The resulting 
graph G is known, but the parameters q, n a , p a ^ and the labeling ti are unknown. In this paper we address the two 
following questions: 

(i) Given the graph G, what are the most likely values of the parameters q, n a , p a b that were used to generate the 
graph? We will refer to this question as parameter learning. 

(ii) Given the graph G and the parameters q, n a , p a b, what is the most likely assignment of a label (group) to a 
given node? In particular, is this most likely assignment better than a random guess? How much better? We 
will refer to this to as inferring the group assignment. 

In order to be able to give a quantitative answer to the second question we define agreement between the original 
assignment {ti} and its estimate {qi} as 

A ({ti}, {qi}) = max — KMn) > ( 4 ) 

i 

where w ranges over the permutations on q elements. We also define a normalized agreement that we call the overlap, 

Q({U},{ qi }) = max ^^"^^ . (5) 
t 1 — max a n a 

The overlap is defined so that if ti — qi for all i, i.e., if we find the exact labeling, then Q = 1. If on the other hand 
the only information we have are the group sizes n a , and we assign each node to the largest group to maximize the 
probability of the correct assignment of each node, then Q = 0. We will say that a labeling {qi} is correlated with the 
original one {ti} if in the thermodynamic limit N — ¥ oo the overlap is strictly positive, with Q > bounded above 
some constant. 

Our main results provide exact answers to the above questions in the thermodynamic limit for sparse networks, i.e. 
when c a b — Np a b = 0(1) and n a — 0(1) are constants independent of size as N — ► oo. Many real- world networks 
are sparse in that the total number of edges grows only as O(N) rather than 0(N 2 ) (although we do not address 
networks with heavy-tailed degree distributions). In the dense case where the av erag e degree diverges as N — > oo, 
learning and inference are algorithmically easier, as was previously realized in [ill Il4| and as will also become clear 
from the large-degree limit of our results. 



B. Optimal inference of the group assignment 

The probability that the stochastic block model generates a graph G, with adjacency matrix A, along with a given 
group assignment {qi}, conditioned on the parameters 9 = {q, {n a }, {p a b}} is 

P(G, {qi} \6) = J] [pfy (1 - Pqi , qj y- A -} II ^ , (6) 

iyij i 

where in the undirected case the product is over pairs i < j. Note that the above probability is normalized, i.e. 

{q } -f^' ili} I 0) = 1- Assume now that we know the graph G and the parameters 6*, and we are interested in 
the probability distribution over the group assignments given that knowledge. Using Bayes' rule we have 

In the language of statistical physics, this distribution is the Boltzmann distribution of a generalized Potts model 
with Hamiltonian 

H({ gi }|G,e) = -^togn fc -2[^Iogc 9 ^ + (l-^)log(l-^-)] , (8) 

where the sum is over pairs i < j in the undirected case. The labels qi are Potts spins taking one of the q possible 
values, and the group sizes n qi (or rather their logarithms) become local magnetic fields. In the sparse case c a b — 0(1), 
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there are strong O(l) interactions between connected nodes, Ay — 1, and weak 0(1/N) interactions between nodes 
that are not connected, Ay = 0. Recall that the Boltzmann distribution at unit temperature is 

-H({ qi }\G,0) 

(i({q i }\G,d) = P({q i }\G,6) = - _ g({gj}|G|9) , (9) 

where the denominator is the corresponding partition function 

Z(G,8) = £V*«*>IG.e). (10) 
{<;.} 

Note that we define H({qi}\G, 9) = — log P(G, {qi} \ 9) — MlogN to keep the formally useful property that the energy 
is extensive, i.e., proportional to N, in the thermodynamic limit. In statistical physics it is usual to work with the 



free energy density 



Since the energy 1(5)1 is extensive, the free energy density has a well defined finite thermodynamic limit f(G, 9). 

It is useful to notice that if the parameters q, {n a }, {c a b} are known then the Boltzmann distribution §§§ is asymp- 
totically uniform over all configurations with the right group sizes and the right number of edges between each pair 
of groups, N a /N = n a and M a b/N = CabUani,. The original, correct group assignment is just one of these configura- 
tions. In a statistical physics sense, the original group assignment is an equilibrium configuration for the Boltzmann 
distribution, rather than its ground state. In particular, if we were presented with the original assignment {ti} and a 
typical assignment {qi} sampled according to the Boltzmann distribution, we would be unable to tell which one was 
correct. 

The marginals of the Boltzmann distribution, i.e. the probabilities Vi(qi) that a node i belongs to a group qi, are 

Vi(Qi)= viUih^hli)- (12) 

Our estimate q* of the original group assignment assigns each node to its most-likely group, 

q* = argmax g . . (13) 

If the maximum of fi{qi) is not unique, we choose at random from all the qi achieving the maximum. We refer to 
this method of estimating the groups as marginalization; in Bayesian inference q* is called the maximum posterior 
marginal. Standard results in Bayesian inference (e.g. 15]) show that it is in fact the optimal estimator of the original 
group assignment {ti} if we seek to maximize the number of nodes at which ti = q* . In particular, the ground state 
of the Hamiltonian ||5J), i.e. the configuration {qf s } that maximizes fj,({qi}), has in general a slightly smaller overlap 
with the original assignment than {<?*} does. 

Note that the Boltzmann distribution is symmetric with respect to permutations of the group labels. Thus the 
marginals over the entire Boltzmann distribution are uniform. However, if this permutation symmetry is broken in the 
thermodynamic limit, so that each permutation corresponds to a different Gibbs state, we claim that marginalization 
within one of these Gibbs states is the optimal estimator for the overlap defined in J5"]). where we maximize over all 
permutations. 

One of the advantages of knowing the marginals of the Boltzmann distribution © is that in the thermodynamic 
limit we can evaluate the overlap Q({U}, {q%}) even without the explicit knowledge of the original assignment {U}. 
It holds that 

Qmargin = hm ^ = hm Q({U},{Qi I) • ( 14 ) 

JV-s-oo 1 — max a n a jv-s-oo 

The overlap Q marg i n measures the amount of information about the original assignment that can be retrieved from 
the topology of the network, given the parameters 9. The marginals Vi(qi) can also be used to distinguish nodes that 
have a very strong group preference from those that are uncertain about their membership. 

Another interesting property is that two random configurations taken from the Boltzmann distribution ([9]) have 
the same agreement as a random configuration with the original assignment ti, i.e. 

lim imaxyVi(7r(*i)) = lim — VY] Vi{af , (15) 
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where tt again ranges over the permutations on q elements. This identity holds only if the associated Boltzmann 
distribution was computed using the correct parameters 9. In the statistical physics of spin glasses, the property (fT5)l 
is known as the equality between the Edwards- Anderson overlap and the magnetization, and holds on the Nishimori 
line. Here the knowledge of the actual parameter values 9 is equivalent to the Nishimori condition being satisfied; for 
more details see Section Hi CI 



C. Learning the parameters of the model 

Now assume that the only knowledge we have about the system is the graph G. The general goal in machine 
learning is to learn the most probable values of the parameters 9 of an underlying model based on the data known to 
us. In this case, the parameters are 9 — {q, {n a }, {c ab }} and the data is the graph G, or rather the adjacency matrix 
Aij. According to Bayes' rule, the probability P(9 1 G) that the parameters take a certain value, conditioned on G, is 
proportional to the probability P(G 1 9) that the model with parameters 9 would generate G. This in turn is the sum 
of P(G, {qi} 1 9) over all group assignments {qi}- 

p ^ g ^y§) p ^ = j^)T. p ( G > {«> i d ) • ( 16 ) 

In Bayesian inference, P[6 \ G) is called the posterior distribution. The prior distribution P{9) includes any graph- 
independent information we might have about the values of the parameters. In our setting, we wish to remain perfectly 
agnostic about these parameters; for instance, we do not want to bias our inference process towards assortative 
structures. Thus we assume a uniform prior, i.e., P{9) = 1 up to normalization. Note, however, that since the sum 
in (|T5| typically grows exponentially with N , we could take any smooth prior P{0) as long as it is independent of N; 
for large N, the data would cause the prior to "wash out," leaving us with the same posterior distribution we would 
have if the prior were uniform. 

Thus maximizing P{9 \ G) over 9 is equivalent to maximizing the partition function (jlOp over 9, or equivalently 
minimizing the free energy density defined in Eq. (ITT|) of the Potts model Q as a function of 9. As in the saddle- 
point method, if the function f(9) has a non-degenerate minimum, then in the thermodynamic limit this minimum 
is achieved with high probability at precisely the values of the parameters that were used to generate the network. 
In mathematical terms, if we call 9* the original parameters and 9 the ones minimizing f{&), then for all e > the 
probability limjv_i.oo Pr[|#* — 6\ < e] = 1. It is also important to note that due to the self-averaging nature of the 
model, as N —¥ oo the free energy depends only on 9 and not on the precise realization of the network. We will 
study the free energy in detail in the next section, but for the moment suppose that it indeed has a non-degenerate 
minimum as a function of 9. In that case we can learn the exact parameters: the number of groups g, their sizes {n a }, 
and the affinity matrix {c ab }. 

In some cases, rather than minimizing f(9) directly, it is useful to write explicit conditions for the stationarity of 
f(9). Taking the derivative of f(9) with respect to n a for 1 < a < q, subject to the condition n °- ~ 1> anc ^ setting 
these derivatives equal to zero gives 



w2> 



v ^ ■/..»/ = = n a Va = l,...,g, (17) 

i 

where by (/({%})) — ^2{ qi } /({'frjOMte}!^ ^) we denote the thermodynamic average. Thus for each group a, the 
most likely value of n a is the average group size; an intuitive result, but one that deserves to be stated. Analogously, 
taking the derivative of f(9) by the affinities c ab gives 

£ (6 qi<a 6 qj , b )=^-=c ab Va,6. (18) 
Nn a n b ..t—i^ Nn a n b 

Meaning that the most likely value of c ab is proportional to the average number of edges from group a to group b. 
More to the point, the most likely value of p ab = c ab /N is the average fraction of the N a N b potential edges from group 
a to group b that in fact exist. In the undirected case, for a — b we have 

NnJ/2 £ = = c - Va - ( 19 ) 

al (i,j)eE a/ 

The stationarity conditions (|17I I19[) naturally suggest an iterative way to search for the parameters 9 that minimize 
the free energy. We start with arbitrary estimates of 9 (actually not completely arbitrary, for a more precise statement 
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see subsequent sections), measure the mean values (N a ) and (M a b) in the Boltzmann distribution with parameters 
9, and update 9 according to ((T71 fT^j) . We then use the resulting 9 to define a new Boltzmann distribution, again 
measure (N a ) and {M ab ), and so on until a fixed point is reached. 

In statistical physics, the stationarity conditions (|17f[TO| can be interpreted as the equality of the quenched and 
annealed magnetization and correlations. In models of spin glasses (e.g. |15| ) they are sometimes referred to as the 
Nishimori conditions. This iterative way of looking for a maximum of the free energy is equivalent to the well-known 
expectation-maximization (EM) method in statistics [l6j . 

Note that if the conditions P7HT5)) are satisfied, the average of the Hamiltonian in the Boltzmann distribution © 
can be easily computed as 

j^{H) = e = - ^2 n a log n a -^2 n a n b c ab log c ah + c , (20) 

a a,b 

where the term c comes from the weak 0(1/N) interactions between disconnected pairs of nodes. In the undirected 
case, 

jj{H) = e = - n a log n a - ^ n a n b c ab log c ab - ^c oa logc Qa + | . (21) 

a a<b a 

Note that as usual in statistical physics the free energy is the average of the energy minus the entropy, f = e — s 
(where here the "temperature" is unity) . The entropy is the logarithm of the number of assignments that have the 
corresponding energy. In the Bayesian inference interpretation the entropy counts assignments that are as good as the 
original one (i.e. they have the same group sizes and the same number of edges between each pair of groups). This 
entropy provides another useful measure of significance of communities in the network, and was studied numerically 
for instance in [l7T |. 



D. Our contribution and relation to previous work 



The Bayesian approach presented in the previous two sections is well known in machine learning and statistics. 
However, it is also well known that computing the partition function and the marginal probabilities of a model defined 
by the Hamiltonian © or computing the averages on the left-hand sides of ([T7l[T^|) is a computationally hard task. 
In general, and our model is no exception, there is no exact polynomial-time algorithm known for these problems. 

A standard tool of statistical physics and statistical inference is to approximate thermodynamic averages using 
Monte Carlo Markov chains (MCMC), also known as Gibbs sampling. Any Markov chain respecting detailed balance 
will, after a sufficient number of steps, produce sample configurations according to the Boltzmann distribution (0). 
This can then be used to compute averages like (N a ) and (M ab ), or even the free energy if we integrate over a 
temperature-like parameter. A central question, however, is the equilibration time of these Markov chains. For very 
large networks, a running time that grows more than linearly in N is impractical. 

The running time of Gibbs sampling is one reason why the Bayesian approach is not that widely used for community 
detection. Exceptions include specific generative models for which the partition function computation is tractable [la . 
[l9j . the work of (20j where a variational approximation to bound the partition function (or related expectations) 
was used, the work of @ where a more elaborate generative model is used to infer hierarchies of communities and 
subcommunities. Another exception is [2lj . where Gibbs sampling is used to estimate the mutual information between 
each node and the rest of the graph in order to perform "active learning." 

The main contribution of this work is a detailed, and in the thermodynamic limit exact, analysis of the stochastic 
block model and its phase diagram with the use of the cavity method [j, [22j developed in the theory of spin glasses. 
We show that there is a region in the phase diagram, i.e., a range of parameters 9, where inference is impossible 
even in principle, since the marginals of the Boltzmann distribution yield no information about the original group 
assignment. There is another region where inference is possible but, we argue, exponentially hard. Finally, there is 
a region where a belief propagation algorithm |5[ , that emerges naturally from the cavity method, computes the free 
energy density and corresponding expectations exactly in the thermodynamic limit, letting us infer the parameters 
optimally and in linear time. As we show later, this algorithm also performs very well for networks of moderate size, 
and it is useful for real- world networks as well. 

The boundaries between these phases correspond to well-known phase transitions in the statistical physics of spin 
glasses: namely, the dynamical transition or the reconstruction threshold, see e.g. @,[23|; the condensation transition 
or the Kauzmann temperature (25l. [26j; and the easy/hard transition in planted models introduced in (l3| . There is 
also a close relation between our approach and the optimal finite-temperature decoding [TH, [27lj29| , and the statistical 
mechanics approach to image processing [30| . 
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In fact, the theory of spin glasses also leads to the conclusion that Gibbs sampling works in linear time, i.e., its 
equilibration time is linear in N, in the same region where belief propagation works. However, belief propagation is 
considerably faster than Gibbs sampling at finding the marginals, since belief propagation produces marginals directly 
while Gibbs sampling requires us to measure them by taking many independent samples. 

Belief propagation was previously suggested as an algorithm for detecting communities in (3l| . However, in that 
work its performance was not studied systematically as a function of the parameters of the block model. Moreover, 
there the parameters 9 were fixed to an assortative community structure similar to the planted partition model 
discussed above, rather than being learned. 

We argue that our Bayesian approach, coupled with belief propagation to compute marginals and estimate the 
free energy, is optimal for graphs generated from the stochastic block model. For such random networks it possesses 
several crucial advantages over the methods that are widely used for community detection in the current literature, 
without being computationally more involved. For a review of methods and results known about community detection 
see for instance [l| and references therein. Let us list some of the advantages: 

• General modules, not just assortative ones: It is fair to say that when there are well-separated assortative 
communities in the network, the community detection problem is relatively easy, and many efficient algorithms 
are available in the literature; see for instance [32j for a comparative study. However, for block models where 
the matrix of affinities p a b is not diagonally dominant, i.e., where functional groups may be disassortative or 
where edges between them are directed, the spectrum of available algorithms is, so far, rather limited. 

• No prior knowledge of parameters needed: Our method does not require any prior knowledge about the 
functional groups or how they tend to be connected. It is able to learn the number of groups q, their sizes n a , 
and the affinity matrix c a b- 

• Asymptotically exact for the stochastic block model: Unlike any other known method, for networks 
that are created by the stochastic block model, in the limit N —> oo our algorithm either outputs the exact 
parameters q, n ai c a b, or halts and reports that learning is impossible or algorithmically hard. In the second case 
we are very confident that no other method will be able to do better in terms of the overlap parameter ([5]) , and 
we will explain the reasons for this conjecture later in the text. 

• Detects when a network has no communities: While people may differ on the precise definition of 
community structure and how to find it, they presumably agree that a purely random graph, such as an Erdos- 
Renyi graph where all pairs of nodes are connected with the same probability, has no community structure to 
discover. However, the vast majority of community detection algorithms do in fact find illusory communities in 
such graphs, due to random fluctuations. For instance, sparse random graphs possess bisections, i.e., divisions of 
the nodes into two equal groups, where the number of edges between groups is much smaller than the number of 
edges within groups. To give a specific example, nodes in a large 3-regular random graph can be bisected in such 
a way that only about 11% of all edges are between the groups [33f . Popular measures of community significance 
such as modularity [sj do not take this fact into account. In contrast, our method naturally recognizes when a 
network does not in fact contain any modular structure. 

• Better measures of significance: More generally, most known methods for community detection aim at 
providing one assignment of nodes to groups; physically speaking, they look for a single ground state. However, 
there are usually a large number of group assignments that are comparable to each other according to various 



quality measure, see e.g. 17]. Our method provides all the thermodynamically significant group assignments 
"at once," by providing the marginal probabilities with which each node belongs to a given community. It 
also provides the entropy of the good assignments, giving us a measure of how non-unique they are. This kind 
of information is much more informative than a single group assignment, even if we can find the "best" one. 
Physically speaking, the Boltzmann distribution tells us more about a network than the ground state does. 

We stress, however, that all of the above is true only for networks generated from the stochastic block model, or for 
real world networks that are well described by this generative model. Indeed, many of the other methods suggested 
in the literature for community detection also implicitly assume that the network under study is well described by 
a similar model. For many real networks, this is not true; for instance, as pointed out in [§[, the block model 
performs poorly on networks where communities include nodes with a very broad range of degrees. On the other 
hand, the Bayesian approach and belief propagation algorithm presented here can be generalized straightforwardly to 
any generative model where the likelihood Q can be written as a product of local terms, such as the degree-corrected 
block model suggested in Q. Thus our methods can apply to a wide variety of generative models, which take various 
kinds of information about the nodes and edges into account. We leave these generalizations for future work. 
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III. CAVITY METHOD AND THE STOCHASTIC BLOCK MODEL 



In this section we derive the cavity equations and the associated belief propagation (BP) algorithm for computing 
the marginal probabilities (fT2")l and the average values (fTTKK)]) we need to learn the parameters 0. When applied 
in the thermodynamic limit TV — > oo, our analysis lets us describe the phase diagram of the learning and inference 
problems, showing for which parameters these problems are easy, hard, or impossible. 



A. Cavity equations: marginals and free energy 



In the literature the so-called replica symmetric cavity method is often understood in terms of the BP algorithm and 
its behavior in the thermodynamic limit with properly chosen initial conditions. From a physics point of view, the 
BP algorithm is an iterative way to compute the partition function by neglecting correlations between the neighbors 
of node i while conditioning on the "spin" or label of node i. Such correlations are non-existent if the network 
of interactions is a tree. On networks that are locally treelike, if correlations decay rapidly with distance these 
correlations are negligible, making BP asymptotically exact. 

Note that in our case the "network of interactions" is fully connected, since in the Hamiltonian (jHJ there are weak 
interactions even along the non-edges, i.e., between pairs of nodes that are not connected. However, as we will see 
these weak interactions can be replaced with a "mean field," limiting the interactions to the sparse network. 

The belief propagation equations are derived from a recursive computation of the partition function with the 
assumption that the network of interactions is a tree. The asymptotic exactness of the replica symmetric cavity 
method (BP equations) is then validated by showing that the correlations that have been neglected are indeed 
negligible in the thermodynamic limit. 

To write the belief propagation equations for the Hamiltonian we define conditional marginals, or messages, 
denoted ip l q ^ J ■ This is the marginal probability that the node i belongs to group qi in the absence of node j. The 
cavity method assumes that the only correlations between i's neighbors are mediated through i, so that if i were 
missing — or if its label were fixed — the distribution of its neighbors' states would be a product distribution. In that 
case, we can compute the message that i sends j recursively in terms of the messages that i receives from its other 
neighbors k: 



u 



1 



n 

kedi\j 



. tk 



k \ N J 



l-A iU 



4>. 



(22) 



where di denotes i's neighborhood, and Z l ~^ J is a normalization constant ensuring J2t V't- 3 = !• We apply ([22]) 
iteratively until we reach a fixed point {tp l q ~* J }- Then the marginal probability is estimated to be fi(ti) = ip^, where 
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These equations are for the undirected case. In a directed network, i would send and receive messages from both its 
incoming and outgoing neighbors, and we would use ct k ^t i or ct i ,t k for incoming and outgoing edges respectively. 

Since we have nonzero interactions between every pair of nodes, we have potentially N(N — 1) messages, and 
indeed (|2"2"|) tells us how to update all of these for finite N. However, this gives an algorithm where even a single 
update takes 0(N 2 ) time, making it suitable only for networks of up to a few thousand nodes. Happily, for large 
sparse networks, i.e., when N is large and c a b = 0(1), we can neglect terms of sub-leading order in TV. In that case 
we can assume that i sends the same message to all its non-neighbors j, and treat these messages as an external field, 
so that we only need to keep track of 2M messages where M is the number of edges. In that case, each update step 
takes just O(M) = O(N) time. 

To see this, suppose that £ E. We have 
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Hence the messages on non-edges do not depend to leading order on the target node j. On the other hand, if (i,j) € E 
we have 
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The belief propagation equations can hence be rewritten as 

1 



n ti e 
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where we neglected terms that contribute 0(1/ N) to ^ l_ * J ', and defined an auxiliary external field 



(26) 



(27) 



In order to find a fixed point of Eq. (|26|) in linear time we update the messages , recompute V> 3 , update the field 
by adding the new contribution and subtracting the old one, and repeat. The estimate of the marginal probability 
Vi{ti) is then 
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When the cavity approach is asymptotically exact then the true marginal probabilities obey Vi(ti) — 
with the original group assignment is then computed from (|14j). Introducing 
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(28) 

The overlap 

(29) 
(30) 
(31) 



we can write the BP estimate for the free energy, also called the Bethe free energy, in the thermodynamic limit as 



/Bp(Un4,{ Cob }) = -l£iog^ + l £ lo s^ 



c 

2 " 



(32) 



where c is the average degree given by ([T]) and the third term comes from the edge-contribution of non-edges (i.e. 

V JoglZ'-Mi. 

When deriving the BP equations (|2"6"|) . the BP estimates of marginals (|28p. and the Bethe free energy ([3^]) . we saw 
that even though the original graph of interactions is fully connected we end up with BP equations on the edges of 
the original network. This network is locally treelike, so standard assumptions about correlation decay suggest that 
the BP equations are then asymptotically exact. 

In the statistical physics of spin glasses the equations presented in this section are called the replica symmetric cavity 
equations. A large amount of work has been devoted to understanding under what circumstances these equations 
give asymptotically exact results for the marginals and the free energy, and when they do not 0, [34| . All known 
cases when the replica symmetric solution of a model like © is not correct on a randomly generated network can be 
divided into two classes: a static spin glass phase with spin glass susceptibility 



XSG = 



-fi E E (** > h ) ~ v i ) v j )] S 



(33) 
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diverging with the system size, or a first order phase transition into a dynamically non-attractive "ferromagnetic" 
phase, see e.g. [35| . 

It is a general property of inference problems that at the correct value of the parameters 9 the static spin glass 
phase never exists. Indeed, when the parameters 6 are the actual ones from which the graph was generated, the 
system satisfies the so-called Nishimori condition, and a very general result is that there is no static spin glass phase 
on the Nishimori line [29j, l36j . 

On the other hand, the first-order phase transition to a dynamically non-attractive ferromagnetic phase cannot be 
avoided. This phase is easy to detect, and to describe asymptotically exactly with the BP equations, if we know the 
true group assignment. Thus the BP analysis is still asymptotically exact for the purpose of analysis of the phase 
diagram. However, in the situation we care about, where the original assignment is not known, this phase transition 
poses an algorithmic problem. We explain this in detail in Section [IVI 
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B. Belief propagation algorithm for inferring the group assignment 

We present here the belief propagation algorithm for inferring the group assignment, and for estimating the free 
energy, the marginals, and the overlap. 

BP-inference (g, n a , c au , Ay , criterium, t max ) 

1 Initialize randomly g-component normalized vector {ip 1 ^ 1 } for each edge 

2 For each node i compute message ip*. according to ([^5)1 ; 

3 Compute the g-component auxiliary field h t according to ([27|k 

4 conv <— criterium + 10; t <— 0; 

5 while conv > criterium and t < t max : 



6 do conv <— 0; t <— t + 1; 

7 for every message i/) 1 "^ (in random order): 

8 do Update all g-components of ^ i_> - J according to (|26l) ; 

9 conv <- conv + - ^.ld* \ 

10 Update ipi using the new value of i/' 1- *'- 7 and (|28p: 

11 Update the field h by subtracting the old xjfl and adding the new value (|27p ; 



12 Compute free energy according to cqs. (f29l |32|) ; 

13 return free energy 

14 return messages {V^ 3 } 

15 return group assignment q* = argmax g ?/>* 

16 return overlap (fl4|) computed with i/j = 

The main cycle of the algorithm takes O(N) time. Specifically, 2M = ciV messages need to be updated, and each 
such update takes 0(c) = 0(1) time operations. The number of iterations needed for the algorithm to converge is, 
in general, a more complicated question. However, at the right value of the parameters 9, the Nishimori condition 
ensures that BP will converge to a fixed point in a constant number (t max ) of steps, so that the total running time of 
the algorithm is O(N). This is illustrated in Fig. [51 A word of caution is that if the parameters 9 are not equal to the 
correct ones, the algorithm might not converge, and indeed sometimes does not. But even in that case, the messages 
after i max iterations can provide useful information. 

To conclude, we summarize some properties of the algorithm. At the correct parameters, the BP algorithm works 
in time linear in the size of the network, and is typically much faster than an equivalent Gibbs sampling algorithm. 
Its superiority with respect to other community detection algorithms lies in the fact that, in addition to finding the 
group assignment that maximizes the overlap with the original assignment, it provides the marginal probabilities that 
each node belongs to each group, along with natural measures of the significance of the inferred assignment. 

Of course, here we assumed that we already knew the parameters 9 with which the network was generated. The 
next step is to use the BP algorithm as a subroutine to learn them, and to discuss under what circumstances this 
learning task is possible. 



C. Belief propagation algorithm to learn the parameters 



The Nishimori conditions (fTTl [T^|) that we use for iterative learning can be written in terms of the BP messages as 
(for undirected graphs) 
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where Z y is defined in (|29|). Therefore, BP can also be used to learn the optimal parameters as follows. 
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BP-LEARNING (q, < n!t , C^* , A (J Crit infcr , Crit loarll ) 

1 n a «- < nit , c a& «- cgf*; 

2 conv <— criticam + 10; 

3 while conv > criti oarn : 

4 do BP-inference (g, n OJ c ab , A ij: crit infer ) 

5 Update n and c Q b according to (13~4"H3"5)) ; 

6 C onv <- £ Q |<™ - < w | + £ Qb IC W " 

7 return n a ,c ab ; 

8 return free energy; 

This is an expectation-maximization (EM) learning algorithm [l6j . where we use BP for the expectation step. The 
update on line [5] can be also done using Gibbs sampling, but BP is much faster at computing the marginals. The 
number of iterations needed for the EM algorithm to converge is constant in the size of the system. However, it 
generally only converges to the correct from some finite fraction of the possible initial parameters 9 mit . In practice, 
several initial values 8 lmt need to be tried, and the fixed point with the smallest final free energy is the correct one. 

When the learning process is possible, this algorithm learns the group sizes n a and the affinity matrix c a b exactly 
in the limit N — » oo. To learn the number of groups q, we run BP-learning for different values of q and find the 
smallest q* such that the free energy density does not decrease further for larger q. 

IV. PHASE TRANSITION IN INFERENCE AND LEARNING 

In this section we will limit ourselves to a particularly algorithmically difficult case of the block model, where the 
graph is undirected and every group a has the same average degree c: 

Q 9 

^2 c ad n d = ^ c bdn d = c , for all a, 6. (36) 

d=l d=l 

If this is not the case, we can achieve a positive overlap with the original group assignment simply by labeling nodes 
based on their degrees, as we will briefly discuss in Section IIV CI We call a block model satisfying (|3"6f a factorized 
block model, and explain the reason for this name in the next paragraph. Note that this case includes both the planted 
partitioning and the planted (noisy) coloring problem discussed in Section III Al 

The first observation to make about the belief propagation equations (f2"tj|) in the factorized block model is that 

C = "ti (37) 

is always a fixed point, as can be verified by plugging (|37[) into (|26[) . In the literature, a fixed point where messages 
do not depend on the indexes i,j is called a factorized fixed point, hence our name for this case of the block model. 
The free energy density at this fixed point is 

/factorized = - (1 ~ log c) . (38) 

For the factorized fixed point we have = n ti , in which case the overlap (TT41 is Q = 0. This fixed point does 
not provide any information about the original assignment — it is no better than a random guess. If this fixed point 
gives the correct marginal probabilities and the correct free energy, we have no hope of recovering the original group 
assignment. For which values of q and c a b is this the case? 

A. Phase transitions in community detection 

We will first study the result given by the cavity method in the thermodynamic limit in the case when the parameters 
q, {n a }, {c a b} used to generate the network are known. 

Fig.[T]represents two examples where the overlap Q is computed on a randomly generated graph with q groups of the 
same size and an average degree c. We set c aa — c- m and c a & = c ou t for all a ^ b and vary the ratio e = c ou t/cin- The 
continuous line is the overlap resulting from the BP fixed point obtained by converging from a random initial condition 
(i.e., where for each i,j the initial messages ipl^ 3 are random normalized distributions on tj). The convergence time 
is plotted in Fig. [5] The points in Fig. Q] are results obtained from Gibbs sampling, using the Metropolis rule and 



13 



obeying detailed balance with respect to the Hamiltonian (|5J), starting with a random initial group assignment {qi}. 
We see that Q = for c out /ci n > e c . In other words, in this region both BP and MCMC converge to the factorized 
state, where the marginals contain no information about the original assignment. For c ou t/c- m < £c, however, the 
overlap is positive and the factorized fixed point is not the one to which BP or MCMC converge. 

In particular the right-hand side of Fig. [T] shows the case of q = 4 groups with average degree c = 16, corresponding 
to the benchmark of Newman and Girvan Q. We show the large N results and also the overlap computed with 
MCMC for size N = 128 which is the commonly used size for this benchmark. Again, up to symmetry breaking, 
marginalization achieves the best possible overlap that can be inferred from the graph by any algorithm. Therefore, 
when algorithms are tested for performance, their results should be compared to Fig. Q] instead of to the common but 
wrong expectation that the four groups are detectable for any e < 1. 




FIG. 1: (color online): The overlap <(Sj between the original assignment and its best estimate given the structure of the graph, 
computed by the marginalization (|13[) . Graphs were generated using N nodes, q groups of the same size, average degree c, and 
different ratios e = Cout/cin. Thus e = 1 gives an Erdos-Renyi random graph, and e = gives completely separated groups. 
Results from belief propagation (|26[) f° r large graphs (red line) are compared to Gibbs sampling, i.e., Monte Carlo Markov 
chain (MCMC) simulations (data points). The agreement is good, with differences in the low-overlap regime that we attribute 
to finite size fluctuations. On the right we also compare to results from the full BP (|22|l and MCMC for smaller graphs with 
N = 128, averaged over 400 samples. The finite size effects are not very strong in this case, and BP is reasonably close to the 
exact (MCMC) result even on small graphs that contain many short loops. For N — >■ oo and e > e c = (c— \/c)/[c+ \/c(q — 1)] it 
is impossible to find an assignment correlated with the original one based purely on the structure of the graph. For two groups 
and average degree c = 3 this means that the density of connections must be e^ 1 (q = 2, c = 3) = 3.73 greater within groups 
than between groups to obtain a positive overlap. For Newman and Girvan's benchmark networks with four groups (right), 
this ratio must exceed 2.33. 



Let us now investigate the stability of the factorized fixed point under random perturbations to the messages when 
we iterate the BP equations. In the sparse case where c a f, = O(l), graphs generated by the block model are locally 
treelike in the sense that almost all nodes have a neighborhood which is a tree up to distance O(logiV), where the 
constant hidden in the O depends on the matrix c a b- Equivalently, for almost all nodes i, the shortest loop that i 
belongs to has length O(logiV). Consider such a tree with d levels, in the limit d — > oo. Assume that on the leaves 
the factorized fixed point is perturbed as 

i>\ = rn + e k , (39) 

and let us investigate the influence of this perturbation on the message on the root of the tree, which we denote kg. 
There are, on average, c d leaves in the tree where c is the average degree. The influence of each leaf is independent, 
so let us first investigate the influence of the perturbation of a single leaf kd, which is connected to ko by a path 
kd, kd-i, ■ ■ ■ , k±, kg. We define a kind of transfer matrix 

= nj^-l). (40) 
where this expression was derived from (1261) to leading order in N . The perturbation e k ° on the root due to the 
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FIG. 2: (color online): The number of iterations needed for convergence of the BP algorithm for two different sizes. The 
convergence time diverges at the critical point e c . The equilibration time of Gibbs sampling (MCMC) has qualitatively the 
same behavior, but BP obtains the marginals much more quickly. 



perturbation e t d on the leaf kd can then be written as 



{««}»=!,. 



~d-l 



t;,ti+i 



(41) 



We observe in ([40|) that the matrix T? b does not depend on the index i. Hence (|4T|) can be written as e k ° = T d e kd . 
When d — > oo, T d will be dominated by T's largest eigenvalue A, so e k ° ~ \ d e kd . 

Now let us consider the influence from all c d of the leaves. The mean value of the perturbation on the leaves is 
zero, so the mean value of the influence on the root is zero. For the variance, however, we have 




This gives the following stability criterion, 

cA 2 = 1 . (43) 

For cA 2 < 1 the perturbation on leaves vanishes as we move up the tree and the factorized fixed point is stable. On 
the other hand, if cA 2 > 1 the perturbation is amplified exponentially, the factorized fixed point is unstable, and the 
communities are easily detectable. 

Consider the case with q groups of equal size, where c aa — c; n for all a and c a b = c ou t for all a ^ b. This includes the 
Newman-Girvan benchmarks, as well as planted (noisy) graph coloring and planted graph partitioning. If there are q 
groups, then C; n + (q— l)c out = qc. The transfer matrix T ab has only two distinct eigenvalues, Ai = with eigenvector 
(1,1,..., 1), and A2 = (c; n — c out )/{qc) with eigenvectors of the form (0, . . . , 0, 1,-1, 0, . . . , 0) and degeneracy q — 1. 
The factorized fixed point is then unstable, and communities are easily detectable, if 

|Cm-Co U t|>£?V^. (44) 

The stability condition (|43| is known in the literature on spin glasses as the de Almeida-Thouless local stability 
condition [33] , in information science as the Kesten-Stigum bound on reconstruction on trees [H, , or the threshold 
for census reconstruction [23| , or robust reconstruction threshold ■ 

We observed empirically that for random initial conditions both the belief propagation and the Monte Carlo Markov 
chain converge to the factorized fixed point when cA 2 < 1. On the other hand when cA 2 > 1 then BP and MCMC 
converge to a fixed point with a positive overlap, so that it is possible to find a group assignment that is correlated 
(often strongly) to the original assignment. We thus conclude that if the parameters q, {n a }, {c a b} are known and if 
cA 2 > 1, it is possible to reconstruct the original group assignment. 
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We can estimate the number of assignments that are just as good as the original one, i.e., that have the right group 
sizes and the right number of edges between each pair of groups, using Eq. (|2ip to express the entropy as 

c 1 

s = log q + - log c - — [(q - l)c out log c out + c; n log c in ] . (45) 

We can think of this entropy as a measure of our uncertainty about the group assignment. 

Next, we discuss a situation when the true marginal probabilities are in the thermodynamic limit equal to 

n ti , and the free energy is given by (|38l) . From the first fact it follows that the graph contains zero (or infinitesimally 
small) information about the original assignment; i.e., the overlap of the marginalized assignment with the original 
one is zero. To understand how is this possible even if e ^ 1, note that from the expressions for the free energy, it 
follows that the network generated with the block model is thermodynamically indistinguishable from an Erdos-Renyi 
random graph of the same average degree. For the coloring problem where c aa = and (q — l)c a b = qc for all a ^ b, 
this was proved in [4l| and discussed under the name "quiet planting" in (l3| . 

The main line of reasoning in the proof of [4l[ goes as follows. First note that the free energy ([55)) is equal to the 
annealed free energy 

/ ann = - lim -j=log[Z] G> (46) 

where [-]g denotes the average over the graphs. Then consider the following thought experiment. First fix the 
parameters q, {n a }, {c a b}', then list in columns all the group assignments with group sizes N a = n a N; and list in 
rows all graphs with M = cN/2 edges. Mark each (graph, assignment) pair (G, {qt}) with the property that G with 
assignment {qi} has the correct number of edges, c a bn a nbN, between each pair of groups. 

Now consider two ways to choose these pairs. The block model corresponds to choosing a random assignment 
(column), and then choosing a random graph (row) consistent with it. In contrast, we can start by choosing a random 
graph (row), and then choose an assignment (column) consistent with the block model. If there are exactly the same 
number of marked pairs in every row and column, these two procedures are the same. In that case, it would be 
impossible to distinguish a graph generated by the block model from an Erdos-Renyi graph. 

It is certainly true that for every group assignment (with fixed group sizes) there are the same number of compatible 
graphs. But different graphs have different numbers of compatible assignments. Thus the number of marked pairs is 
different in different rows. The number of marked pairs in each row is essentially the partition function Z. Now, if 
all but an exponentially small fraction of graphs have the same typical properties as a random graph (this is a large 
deviation principle), it follows that also the block model generates typical random graphs as long as the quenched 
free energy equals the annealed one, i.e., when limjv_ j . 00 log [Z]g/N = limjv-».oo[log Z]g/N. 

In the example presented in Fig. [TJ both BP and MCMC confirm that the free energy ([55]) is the correct free energy 
and that the marginals are uniform, z/j(a) = n a , for e > e c . The only possibility known in statistical physics when 
MCMC does not give the correct free energy is ergodicity breaking on the time-scale during which the MCMC was 
performed (i.e. insufficient equilibration time). Indeed, the original group assignment could belong to a part of the 
phase space that dominates the Boltzmann distribution, but that is invisible to dynamics on time-scales linear in the 
size of the system. Such a situation indeed happens in systems that undergo the ideal glass transition. There is a 
simple trick to verify if such a glass transition appears or not in the block model: just run BP or MCMC using the 
original group assignment as the initial condition. 

When we do this for the examples shown in Fig. [TJ there is absolutely no change in the result. Thus the free energy 
and overlaps presented in Fig. [TJ are asymptotically exact and we can distinguish two phases: 

• If I c nl — c ou t | < q\fc, the graph does not contain any significant information about the original group assignment, 
and community detection is impossible. 

• If | c; n — Co U t | > q\fc, the graph contains significant information about the original group assignment, and using 
BP or MCMC yields an assignment that is strongly correlated with the original one. There is some intrinsic 
uncertainty about the group assignment due to the entropy, but if the graph was generated from the block model 
there is no better method for inference than the marginalization introduced by Eq. (|13p . 

Fig. [T] hence illustrates a phase transition in the dctcctability of communities. Unless the ratio c ou t/cin is far enough 
from 1, the groups that truly existed when the network was generated are undetectable from the topology of the 
network. Moreover, unless the condition (|44p is satisfied the graph generated by the block model is indistinguishable 
from a random graph, in the sense that typical thermodynamic properties of the two ensembles are the same. 

The situation illustrated in Fig. [JJ is, however, not the most general one. Fig. [3] illustrates the case of planted 
coloring with q — 5, c m = 0, and c ou t = qc/{q — 1). In this case the condition for stability (|33|) leads to a threshold 
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FIG. 3: (color online): Left: graphs generated with q = 5, c ln = 0, and N = 10 5 . We compute the overlap ((5} and the free 
energy with BP for different values of the average degree c. The green crosses show the overlap of the BP fixed point resulting 
from using the original group assignment as the initial condition, and the blue crosses show the overlap resulting from random 
initial messages. The red stars show the difference between the factorized free energy (|38[) and the free energy resulting from 
the planted initialization. We observe three important points where the behavior changes qualitatively: Cd — 12.84, c c = 13.23, 
and ce = 16. We discuss the corresponding phase transitions in the text. Right: the case q — 10 and c = 10. We plot the 
overlap as a function of e; it drops down abruptly from about Q — 0.35. The inset zooms in on the critical region. We mark 
the stability transition et, and data points for N = 5 ■ 10 5 for both the random and planted initialization of BP. In this case 
the data are not so clear. The overlap from random initialization becomes positive a little before the asymptotic transition. 
We think this is due to strong finite size effects. From our data for the free energy it also seems that the transitions e c and 
are very close to each other (or maybe even equal, even though this would be surprising). These subtle effects are, however, 
relevant only in a very narrow region of e and are, in our opinion, not likely to appear for real-world networks. 



value ce = (q — l) 2 . We plot again the overlap obtained with BP, using two different initializations: the random one, 
and the planted one corresponding to the original assignment. In the latter case, the initial messages are 

4T j = Ku , (47) 

where ti is the original assignment. We also plot the corresponding BP free energies. As the average degree c increases, 
we see four different phases in Fig. [3] 

I. For c < Cd, both initializations converge to the factorized fixed point, so the graph does not contain any significant 
information about the original group assignment. The ensemble of assignments that have the proper number 
of edges between each pair of groups is thermodynamically indistinguishable from the uniform ensemble. The 
original assignment is one of these configurations, and there is no possible way to tell which one it is without 
additional knowledge. 

II. For Cd < c < c c , the planted initialization converges to a fixed point with positive overlap, and its free energy 
is larger than the annealed free energy. In this phase there are exponentially many basins of attraction (states) 
in the space of assignments that have the proper number of edges between each pair of groups. These basins 
of attraction have zero overlap with each other, so none of them yield any information about any of the others, 
and there is no way to tell which one of them contains the original assignment. The annealed free energy is still 
the correct total free energy, the graphs generated by the block model are thermodynamically indistinguishable 
from Erdos-Renyi random graphs, and there is no way to find a group assignment correlated with the original 
one. 

III. For c c < c < a, the planted initialization converges to a fixed point with positive overlap, and its free energy is 
smaller than the annealed free energy. There might still be exponentially many basins of attraction in the state 
space with the proper number of edges between groups, but the one corresponding to the original assignment 
is the one with the largest entropy and the lowest free energy. Therefore, if we can perform an exhaustive 
search of the state space, we can infer the original group assignment. However, this would take exponential 
time, and initializing BP randomly almost always leads to the factorized fixed point. In this phase, inference is 
possible, but exponentially hard; the state containing the original assignment is, in a sense, hidden below a glass 
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transition. Based on the physics of glassy systems, we predict that no polynomial-time algorithm can achieve a 
positive overlap with the original group assignment. 

IV. For c > C£, both initializations converge to a fixed point with positive overlap, strongly correlated with the 
original assignment. Thus inference is both possible and easy, and BP achieves it in linear time. Indeed, in 
this easy phase, many efficient algorithms will be able to find a group assignment strongly correlated with the 
original one. 

We also investigated the case q — 5, Cj n = 0, illustrated in Fig. |3l with Gibbs sampling, i.e., the Markov chain 
Monte Carlo algorithm. For the planted initialization, its performance is generally similar to BP. For the random 
initialization, MCMC agrees with BP only in phases (I) and (IV). It follows from results on glassy systems [42j that 
in phases (II) and (III), the equilibration time of MCMC is exponentially large as a function of N, and that its 
performance in linear time, i.e., CN for any constant C, does not yield any information about the original group 
assignment. 

The boundaries between different phases correspond to well-known phase transitions in the statistical physics of 
spin glasses. Specifically, a is the dynamical transition or reconstruction threshold, see e.g. [2^, [2^] - The detectability 
threshold c c corresponds to the condensation transition or the Kauzmann temperature. Finally, ci is the easy/hard 
transition in planted models introduced in [l3j . There is also a close relation between our approach and optimal finite 
temperature decoding [lj| [27M29| and the statistical mechanics approach to image processing [3(J ■ 

We saw in our experiments, consistent with the reconstruction thresholds Cd and the Kesten-Stigum bound q in ['2'A . 
that for assortative communities where Ci n > c ou t, phases (II) and (III) are extremely narrow or nonexistent. For 
q < 4, these phases do not exist, and the overlap grows continuously from zero in phase (IV), giving a continuous phase 
transition as illustrated in Fig.[T] For q > 5, phases (II) and (III) occur in an extremely narrow region, as shown on the 
right in Fig. [3] The overlap jumps discontinuously from zero to a relatively large value, giving a discontinuous phase 
transition. In the disassortative (antiferromagnetic) case where c m > c ou t, phases (II) and (III) are more important. 
For instance, when c; n = and the number of groups is large, the thresholds scale as Cd ~ qlogq, c c « 2qlog<7 and 
C£ = (q — l) 2 . However, in phase (III) the problem of inferring the original assignment is as hard as finding a solution 
to a random Boolean satisfiability problem close to the satisfiability threshold, or coloring a random graph close to the 
q-colorability threshold. These are NP-complctc problems, and are believed to be exponentially hard in this region. 

In phase (IV), i.e., when (|4"4")l is satisfied, inference is easy in linear time with both BP and MCMC, and likely with 
many other community detection algorithms in the literature (at least in the assortative case, which has received the 
most attention). But, at least for networks generated by the block model, there is very little space for algorithmic 
improvement: either the inference problem is easy (in linear time), or exponentially hard, or impossible. Our results 
suggest that there is no middle ground where inference is possible but requires time, say, 0(N°) for some c > 1. 

As far as we know, the phase transitions presented here are previously unknown in the literature on community 
detection. However, the phase transition in detectability was predicted by the authors of They used an ap- 

proximate (replica symmetric) calculation of the ground state energy of the Hamiltonian ((5J) on networks created by 
the block model, and noticed that it differs from its value on random graphs only when c ou t/ci n is sufficiently small. 
Based on this, they predicted an undetectable region if the probability that a random edge connects two nodes of the 
same community is smaller than a critical value p c . Our exact calculation f|44[) leads to p c = [c+ (q — l)^/c]/(qc) when 
c in > c ou t, showing that [43[ overestimated the size of the undetectable region; this also explains the discrepancy with 
their numerical data. The problem with their calculation is that the ground state energy of © cannot be computed 
correctly using the replica symmetric approximation, and that the ground state does not maximize the overlap with 
the original assignment in any case. In contrast, we focus on the free energy, and our calculations are exact in the 
thermodynamic limit. In addition, [43| only treated the cases in Fig. [1] and did not encounter the case when the 
detectability transition is discontinuous. 

Another related work that numerically investigated the behavior of a Potts-like Hamiltonian J5J with general 
parameters on networks generated from the block model is [44j]. The principal difference between |44| and our work 
is that they do not focus on the parameters of the Hamiltonian with which the network was generated, or discuss 
optimal inference of the group assignment with those parameters. 

B. Phase transitions in parameter learning 

In this section we will continue to focus on the special case of the parameters defined by (|36|) . where all groups have 
the same average degree. We will, however, no longer assume that the correct values of the parameters q, {n a } and 
{Cab} are known; now our goal is to learn them. 

In Fig. SI we generate graphs with q = 2 groups of the same size, N = 10 5 , with average degree c, and Cn = C22 = Cj n 
and C12 = c ou t where e* = c ou t/ci n = 0.15. We then compute the free energy as a function of e. The left-hand side of 
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Fig. [4] shows the factorized free energy (|38|) minus the free energy obtained from the BP fixed point as a function of e. 
As expected, this curve is maximized at the correct value e = e* . The learning procedure searches for this maximum. 

Note that the initial values of the parameters are important. For e > e s = 0.36, the free energy equals the factorized 
one, and BP converges to the factorized fixed point. Hence if we start the learning process with e > e s , the local slope 
of the free energy will not push us in the correct direction. 

The right-hand side of Fig. |4] shows (red crosses) the corresponding values of the overlap between the marginalized 
group assignment and the original one. We see that this overlap is maximized at the correct value e = e*. We also 
plot the estimated overlap (Eq. (|14[) . green crosses), which is based only on the marginals with no knowledge of the 
original group assignment. At e = e*, it indeed equals the true overlap with the original assignment. 



0.025 
0.02 
0.015 
I 0.01 

CD 

| 0.005 


-0.005 
-0.01 



"W+'factorized + """ 
q=2, c=3 



e*=0.15 



I I I I I I I I I I I I I 



0.1 0.2 0.3 0.4 

£ = C n ,|t/Cj n 



0.5 



0.8 



0.6 



0.4 



0.2 



overlap actual — I — 
overlap marginalization — H— 

q=2, c=3 




0.1 0.2 0.3 









e*=0.15 \ 



0.66 
0.65 
0.64 
0.63 
0.62 



0.1 



0.2 0.3 



0.4 



0.5 



FIG. 4: (color online): Learning for graphs of N = 10 5 nodes with q = 2 groups, average degree c = 3, and e* = c out /ci n = 0.15. 
Left: the BP free energy as a function of e. Specifically, we plot the factorized free energy (which is independent of e) minus 
the BP free energy. As we expect, the maximum is achieved at e = e*. Our learning procedure looks for this maximum via a 
kind of expectation-maximization (EM) algorithm. Note that for e > e s = 0.36 the BP free energy is equal to the factorized 
one, so we need to initialize the learning process somewhere in the region e < e s . Right: the overlap ((5| between the original 
group assignment and the best estimate using BP marginalization, compared to the estimated overlap (|14|) . They are equal 
only at the correct parameters, e = e* . In the inset we see that the actual overlap is maximized at e = e*, illustrating that to 
infer the group assignment optimally one needs to have the correct parameters. 



Fig. [5] uses graphs generated in the same way as in Fig. 21 For each e we compute the averages (fl~MT9| from the 
BP fixed point. In terms of the BP messages, the most likely values of the parameters are then, as in 



q 2 ^ Cout (vrvr+cvo (48) 



- T 



(49) 



The learning process iteratively updates c ln and c ou t (more generally, the affinity matrix c a b) and looks for a fixed 
point where e' = e. As we said above, this is essentially an expectation-maximization (EM) algorithm, where we use 
BP to approximate the expectation step. 

In Fig. [5] we plot e' = c' out /c[ n as a function of e. We see that e* is the only fixed point in its vicinity. However, 
every e > e s is also a fixed point due to the factorized BP fixed point, again showing that we need to initialize the 
learning process at some e < e s . 

On the right-hand side of Fig. [S] we depict the region in the (e, c) plane in which the learning process converges to e* 
if we start it at e. We see that learning is possible for c > q = 1.83, where q was obtained from (|44|) by considering 
e* = 0.15. But even in this region c > ct one should not start with too large a value of e. It is better to start with 
e = (i.e., completely separated groups) rather than with e = 1 (an undifferentiated random graph). The same is 
true in the antiferromagnetic case, i.e., if c ou t > c m , if we define e = Ci n /c ou t- 

In general we conclude that the phase transitions for inference (i.e. when parameters are known) are present 
also in learning. Whenever inference is possible the asymptotically correct parameters can be learned with a proper 
initialization. The set of good initial values of the parameters always takes a finite fraction of the all possible initial 
values (hence finding good initialization takes a finite number of steps). 
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FIG. 5: (color online): Left: learning on graphs generated with q = 2, c = 3, and e* = c out /ci n = 0.15. For each e we compute 
the averages in (|18H19[) from the BP fixed point, update c ou t and ci n accordingly, and plot the new ratio e' = c' out /c' in as a 
function of e. This process converges to e* if we initialize it at e < e s : in contrast, every e > e s is an (incorrect) fixed point. 
Right: the shaded region illustrates the initial values (e, c) of the parameters from which the learning process converges to e*. 
Learning is possible for c > Cf, where a is given by (|44[) . Graph generated with q = 2, c = 3, e* = 0.15, and different values of 
average degree c. BP is run with e^e', for e < e conv the BP does not converge. The magenta line corresponds to the largest 
e*(c), given by (|44f) . for which communities are detectable at a given average degree c. 



If the group sizes are unknown, we can learn them in a similar manner, updating them using f|34[) . On the other 
hand, learning the number of groups requires a different approach. Fig. [6] shows the dependence of the free energy 
on q, for an example where the correct number of groups is q* = 4. If q > q*, there are multiple assignments where 
q — q* groups are empty, so the free energy is not maximized at q* . Instead, the free energy grows as long as q < q* , 
and then stays constant for q > q* . To learn the correct number of groups we thus need to run the algorithm for 
several values of q and select the q at which the free energy stops growing. 

It is instructive to discuss the parameter values that achieve the maximum free energy when q > q*. These are for 
instance group sizes where q — q* groups are empty. But there is, in general, a continuum of other fixed points of the 
learning process with the same free energy; for instance, where one group is divided into two in an arbitrary way. The 
learning process converges to one of these fixed points, and we have not found a way to determine q* more directly 
than running BP with a lower value of q and comparing the free energies. We stress that this method of learning the 
number of groups is asymptotically exact for networks generated by the block model. However, for real networks the 
free energy of the block model does not generally saturate at any finite q, as we will discuss in the next section. 
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FIG. 6: (color online): The (negative) free energy for a graph generated with q* = 4 groups, average degree c = 16, e = 0.2, 
and iV = 10 4 . We run BP for various values of q and plot the (negative) free energy. The correct number of groups is the q at 
which the free energy saturates. 
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C. When groups have unequal average degree 

In the previous section we studied the asymptotic behavior of the stochastic block model in the case (|36p where 
every group has the same average degree. In that case, the degree sequence does not contain any information about 
which node should be in which group, and phases exist in parameter space where the original group assignment cannot 
be inferred from the structure of the network. 

If (136)) is not satisfied, each group a has average degree c a depending on a. In that case, the undetectable phase does 
not exist, since classifying nodes according to their degree yields a nonzero overlap with the original group assignment. 
Our procedure for optimal inference and parameter learning described in Sections [II Bl and I II CI still holds here, as do 
the BP algorithm and its asymptotic analysis. Section [IIII can be used to infer the original assignment and the model 
parameters, and these algorithms are asymptotically exact. In the generic case where the group average degrees are 
all distinct, they can be learned exactly in the thermodynamic limit, even if the differences between them are small. 

The phase transitions in the inference problem described in Section IIV Al however, still exist and can again be 
investigated using the cavity method, although the condition (|44|) for easy inference does not have a simple analytic 
form anymore. In the case where the detectability phase transition is discontinuous, we can again analyze the phase 
diagram by considering BP with the planted and random initializations. As we travel along some curve through 
the space of parameters {n a }, {c a b}, the transition a corresponds to the point where the two initializations start to 
converge to two different fixed points, c c (the detectability transition) to the point at which their free energies become 
equal, and finally ct (the hard/easy transition) to the point where they both converge to the same fixed point, which 
is strongly correlated with the original assignment. In the case where the phase transition is continuous, there is just 
one transition, where the BP convergence time diverges as in Fig. [5] 

V. PERFORMANCE ON REAL- WORLD NETWORKS 

We tested our inference and learning algorithms on several real- world networks. We present our findings on two 
particular examples: Zachary's karate club [45| and a network of books on politics. The purpose of this discussion 
is not to argue that our algorithms outperform other known methods; both these networks are small, with easily- 
identifiable communities. Rather, our point is that our algorithms provide a quantitative comparison between these 
real-world networks and those generated by the stochastic block model. More generally, our techniques allow us to 
quantitatively study the extent to which a network is well-modeled by a given generative model, a study that we feel 
more work should be devoted to in the future. 

First let us make a remark about our algorithm's performance on synthetic benchmarks that are generated by the 
stochastic block model. The results on the right-hand side of Fig. [1] correspond to the four-group networks of Newman 
and Girvan Q, that have been used as benchmarks for many algorithms in the literature. Up to symmetry breaking, 
the overlap with the original group assignment shown in Fig. [T] is the best that can be achieved by any inference 
algorithm, and both MCMC (Gibbs sampling) and the BP algorithm achieve this optimum in linear time. Thus the 
right way to measure the performance of a community detection algorithm on these networks is to compare their 
results to Fig. [TJ This holds also for more general synthetic benchmarks generated by the block model, like those 
in HH. 

A. Zachary's karate club 

Zachary's karate club [45j is a popular example for community detection. It consists of friendships between the 
members of a karate club which split into two factions, one centered around the club president and the other around 
the instructor. It has 34 nodes and 78 edges. We ran the BP learning algorithm on this network with q = 2 groups. 
Depending on the initial parameters {n a }, {c a b}, it converges to one of two attractive fixed points in parameter space: 

(l) _ / 0.525 \ (<) _ / 8.96 1.29 \ 
^ 0.475 J ' ^ 1.29 7.87 J ' 

B W - 16.97 12.7 \ 

n ~ \0.U6 J ' c - ^ 12.7 1.615 J ■ 

For comparison, we also performed learning using MCMC for the expectation step; this network is small enough, 
with such a small equilibration time, that MCMC is essentially exact. We again found two attractive fixed points in 
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parameter space, very close to those in ([50 



(») 

l MC 



0.52 
0.48 



„(*) 

-MC 



0.85 
0.15 



-MC 



8.85 
1.26 

16.58 
12.52 



1.26 
7.97 

12.52 
1.584 



(51) 



A first observation is that even though Zachary's karate club is both small and "loopy," rather than being locally 
treelike, the BP algorithm converges to fixed points that are nearly the same as the (in this case exact) MCMC. This 
is despite the fact that our analysis of the BP algorithm assumes that there are no small loops in the graph, and 
focuses on the thermodynamic limit N — > oo. This suggests that our BP learning algorithm is a useful and robust 
heuristic even for real-world networks that have many loops. 
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FIG. 7: (color online): On the left: the partitioning of Zachary's karate club found by our inference algorithm using the first 
fixed point, (i) in (|50p . The colors indicate the two groups found by starting with an assortative initial condition, i.e., where 
cn,C22 > C12. The shades represent the marginal probabilities: a white node belongs to both groups with equal probability, 
whereas a node that is solid red or solid blue belongs to the corresponding group with probability 1. Most of the nodes are 
strongly biased. The xs show the five nodes that are grouped together by the second fixed point, (ii) in (|50[) . which divides 
the nodes into high-degree and low-degree groups rather than into the two factions. On the right: the negative free energy 
for parameters interpolating between the two fixed points, with (i) at t = and (ii) at t = 1. The two fixed points are local 
maxima, and each one has a basin of attraction in the learning algorithm. As noted in the high-degree/low-degree fixed 
point actually has lower free energy, and hence a higher likelihood, in the space of block models with q = 2. The horizontal lines 
show the largest values of the likelihood that we obtained from using more than two groups. Unlike in Fig. [5] the likelihood 
continues to increase when more groups are allowed. This is due both to finite-size effects and to the fact that the network 
is not, in fact, generated by the block model: in particular, the nodes in each faction have a highly inhomogeneous degree 
distribution. 



Fig. [7] shows the marginalized group assignments for the division into two groups corresponding to these two fixed 
points. Fixed point (i) corresponds to the actual division into two factions, and cj'} has assortative structure, with 
larger affinities on the diagonal. In contrast, fixed point (ii) divides the nodes according to their degree, placing high- 
degree nodes in one group, including both the president and the instructor, and the low-degree nodes in the other 
group. Of course, this second division is not wrong; rather, it focuses on a different kind of classification, into "leaders" 
on the one hand and "students/followers" on the other. On the right side of Fig.[7]we plot the negative free energy ([52")) 

achieved by interpolating between the two fixed points according to a parameter i, with c a b(t) — (1 — t)c^} +tc^ and 
similarly for n a . We see that the two fixed points correspond to two local maxima, the second (ii) being the global 
one. Thus if we assume that the network was generated by a block model with q = 2, the second fixed point is the 
more likely division. 

As recently pointed out in [8j, the block model we study in this paper does not fit Zachary's network particularly 
well. This is because the nodes in each faction are not equivalent to each other. In particular, the "hubs" or "leaders" 
of each faction have significantly higher degrees than the other nodes do; in our block model this is unlikely, since the 
degree distribution within each group is Poisson. The authors of H show that we can obtain a better classification, 
in the sense of being closer to the two factions, using a degree- corrected block model that takes this inhomogeneous 
degree distribution into account. Happily, as we will discuss in future work, our BP approach and learning algorithm 
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generalizes easily to these degree corrected block models. Under the degree-corrected block model the factional division 
(i) does indeed become the most likely one. 




FIG. 8: (color online): On the left, the marginalized group assignment for Zachary's karate club with q = 4. We again use the 
shade to indicate how strongly biased the marginal is, up to white if it is 1/q. Almost all nodes are strongly biased. The white 
and dark grey regions correspond to the two factions, and within each group we have high- and low-degree nodes. Thus our 
algorithm finds a partition that divides nodes according to both their faction and their degree. On the right, we compare the 
negative free energy (i.e., the likelihood) as a function of the number of groups for Zachary's network and a synthetic network 
with the same parameters as the q — 4 fixed point. The free energy levels off at q = 4 for the synthetic network, but not as 
sharply as it does in the thermodynamic limit (see Fig. |6|. For Zachary's network, the free energy continues to improve as q 
increases, due to further inhomogeneities within the groups. 



As discussed in Section IIVB[ our methods allow us to learn the correct number of groups q* for large networks 
generated from the block model by comparing the best free energy achieved for different q. In the thermodynamic 
limit, the free energy becomes constant when q > q* as shown in Fig. [B] However, for real networks the situation is 
less simple. As shown on the right of Fig. for Zachary's network our algorithm finds fixed points with decreasing 
free energy (i.e., increasing likelihood) as q increases. On the left side of Fig. |8l we show the marginalized group 
assignment we found for q — 4. As we would expect given the fact that each faction contains both high- and low- 
degree nodes, the four-group classification separates nodes according to both their faction and their degree, dividing 
each faction into leaders and followers. As q increases further, so does the likelihood, due to the fact that there are 
further inhomogeneities within the groups, i.e., further deviations from the block model. Continuing to subdivide the 
groups leads to a hierarchy of groups and subgroups as in |6[ , for instance separating peripheral nodes in each faction 
from those that are directly connected to the hubs. 

However, even for networks generated by the block model, there are finite-size effects that make learning the number 
of groups difficult. To separate these finite-size effects from the inhomogeneities in Zachary's network, we used the 
block model to generate synthetic networks, using the parameters that our algorithm learned for Zachary's network 
with q = 4. On the right side of Fig. [8] we show the negative free energy obtained by our algorithm on the real 
and synthetic networks for various values of q. For the synthetic networks, the likelihood levels off to some extent 
when q > q* , but does not become constant. Thus these finite-size effects explain some, but not all, of the likelihood 
increase observed for the real network as q increases. 



B. A network of political books 



The other real network we discuss here is a network of political books sold on Amazon compiled by V. Krebs 
(unpublished). These books are labeled according to three groups, liberal, neutral, or conservative. Edges represent 
co-purchasing of books by the same customer as reported by Amazon. Running our learning BP algorithm with q = 3 
yields a group assignment with an overlap of Q = 0.74. The parameters learned by our algorithm, and the most-likely 
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parameters given the original labeling, are respectively 



18.0 2.7 

n w = ( 0.39 | , c (l) = ( 2.7 21.2 0.15 ) (52) 

2.1 0.15 



/ 0.24 \ 




0.39 




\ 0.37 / 




0.13 \ 




0.46 , 


^actual — 


0.41 / 






12.1 5.6 4.5 

^actual = ( 0.46 ] , Cactual = ( 5.6 17 0.6 | . (53) 

4.5 0.6 20 7 

The marginalized group assignment for q = 3 our algorithm finds using the learned parameters , n$ is shown 
in Fig. [9l The letters correspond to the actual group assignment for nodes where our assignment disagrees with 
Krebs' labels. The groups corresponding to liberal and conservative books agree very well with his labels, and are 
well-separated with very few links between them. However, the middle group found by our algorithm is a mixture of 
the three types of books. It could be that the original labels of these books are misleading, or that these particular 
books appeal to customers who buy books from all three parts of the political spectrum. 




FIG. 9: (color online): The marginalized group assignment with q = 3 for the political book network, with liberal, neutral, 
and conservative books labeled red, green, and blue respectively. The letters R, G, and B indicate the original labels where 
they disagree with our algorithm. As for Zachary's network, the likelihood increases as we increase the number of groups. For 
q = 5, running our algorithm with q = 5 subdivides the red and blue groups into subcommunities as shown. 

We also ran our algorithm with q ^ 3 groups. Using q — 2 distributes the neutral books in the two larger groups. 
For q = 5, the liberal and conservative groups each break into two subgroups as shown in Fig. [SI which consist mainly 
of high-degree and low-degree nodes within each of these groups. This structure can be observed in the corresponding 
affinity matrix 
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where group 1 consists of the neutral books, groups 2 and 3 are the low- and high-degree liberal books, and 4 and 5 are 
the low- and high-degree conservative books. However, as for Zachary's karate club, the likelihood keeps increasing 
for larger q, suggesting a hierarchy of groups or subgroups, or simply that the block model is not the right generative 
model for this network. 



VI. CONCLUSIONS 



We analyzed the thermodynamic properties of networks generated by the stochastic block model, focusing on the 
questions of how to optimally infer the original group assignment, and learn the parameters of the model, from 
the topology of the generated graph. Using the cavity method we provided an asymptotically exact answer to 
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these questions, describing the phase diagram and transitions between phases where inference is easy, possible but 
exponentially hard, and impossible. These transitions are closely related to known phase transitions in the mean field 
theory of spin glasses. Remaining open questions include an analysis of finite size effects, and mathematically rigorous 
proofs of our results. 

In the easy phase, our analysis leads to a belief propagation (BP) algorithm that infers the group assignment opti- 
mally, i.e., that maximizes the overlap with the original assignment, and learns the underlying parameters, including 
the correct number of groups in the thermodynamic limit. This algorithm is highly scalable, with a running time 
that is linear in the size of the network. While MCMC sampling also runs in linear time, BP is considerably faster at 
providing the marginal probabilities, which also give a measure of how strongly each node belongs to its group. Our 
methods can detect more general types of functional communities than many other methods of community detection 
algorithms, and also provide measures of significance of the community structure, letting us distinguish purely random 
graphs from those with modular structure. 

While many real-world networks are not well modeled by the type of stochastic block model we study here, our 
analysis and our BP learning algorithm easily generalize to any generative model where the likelihood function ([6]) 
can be written as a product of local terms, such as the degree-corrected block models of Q. We will discuss these 
generalizations in future work. 
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